L07: Pandas I/O

Preliminaries

If you have not done so already (you were asked to do this in lecture01):

Open a Terminal, type the following command and hit enter:
- pip install yfinance pandas-datareader
Open a Terminal, type the following command and hit enter:
- conda install -y openpyxl xlrd

import pandas as pd
import numpy as np
from pathlib import Path
import requests
import pandas_datareader as pdr
import yfinance as yf

Relative vs absolute paths

In this lecture, we will have to tell Python exactly where on our drive it should store, some files we create. We can do this by specifying the full path to these locations, for example:

‘C:/Users/ionmi/Dropbox/TEACHING’

The above is an absolute path: it contains the full path to the TEACHING folder on my drive.

An alternative way to specify a path (which we will use very often in this course) is to specify that path relative to the current working directory. To do this, we use a combination of one or more dots (‘.’) and/or slashes (‘/’) which have the following meaning:

‘.’ means the current working directory (in our case, this is the directory where these lecture notes are stored on the drive).

‘..’ means the parent of the current working directory.

‘../..’ means the parent of the parent of the current working directory.

‘../../..’ means the parent of the parent of the parent of the current working directory.

etc.

To see this in practice, we can use the Path function in the pathlib package, which allows us to see the absolute path of a given relative path:

Path('.').resolve() #this will give you a different output than mine

PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525/lectures/lecture08_pandas_io')

Path('..').resolve()

PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525/lectures')

Path('../..').resolve()

PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525')

Reading and writing .pkl (pickle) files

Python has a proprietary data format called “pickle”. These types of files have the extension “.pkl”. Saving and loading data from pickle files is significantly faster than from/to “.csv”, so we will be using it quite a bit throughout the course.

df = pd.DataFrame(data=np.random.rand(5,4), columns = list('ABCD')) #you'll get different numbers every time you re-run this
df

	A	B	C	D
0	0.377738	0.380756	0.822901	0.685419
1	0.134739	0.608980	0.171573	0.064065
2	0.208184	0.155670	0.083572	0.922087
3	0.796854	0.193076	0.262668	0.639699
4	0.927714	0.290597	0.763463	0.390035

.to_pickle()

To store this data in a .pkl file, we use the “.to_pickle” function, applied right after the name of the dataframe which contains the data we want to store.

Syntax:

DataFrame.to_pickle(path, compression='infer', protocol=5, storage_options=None)

Note that the first argument (path) is mandatory (it has no default value). This argument is where you specify the name of the .pkl file you want to create (mydata.pkl below) and the location (directory) where this file should be stored (. below) all in a single string, separated by /.

df.to_pickle('./mydata.pkl')

Note that we can also compress the file:

df.to_pickle('./mydata.zip')

.read_pickle()

To read data from an existing .pkl file, we use the “.read_pickle” function, specifying as an argument the path to the file we want to read (including its name):

Syntax:

pandas.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)

For example, if we want to read the contents of the .pkl file we just created above, and store those contents into a new variable df2, we would use:

df2 = pd.read_pickle('./mydata.pkl')

df2

	A	B	C	D
0	0.377738	0.380756	0.822901	0.685419
1	0.134739	0.608980	0.171573	0.064065
2	0.208184	0.155670	0.083572	0.922087
3	0.796854	0.193076	0.262668	0.639699
4	0.927714	0.290597	0.763463	0.390035

And we can read compressed .pkl files too:

df2 = pd.read_pickle('./mydata.zip')

df2

	A	B	C	D
0	0.377738	0.380756	0.822901	0.685419
1	0.134739	0.608980	0.171573	0.064065
2	0.208184	0.155670	0.083572	0.922087
3	0.796854	0.193076	0.262668	0.639699
4	0.927714	0.290597	0.763463	0.390035

Note a very important difference in how we use the two functions above. The syntax for .to_pickle() starts with DataFrame.to_pickle which tells us that the function must be applied to and existing DataFrame. On the other hand, the syntax for .read_pickle() starts with pandas.read_pickle, which we converted to pd.read_pickle, because we imported pandas as pd in the first cell code in this notebook (at the top).

This pattern is the same for all the read-write functions we discuss in this lecture: the write functions (.to_pickle(), .to_csv(), .to_excel()) are written after the name of the dataframe we want to write to a file, while the read functions (.read_pickle(), .read_csv(), .read_excel()) follow the name of the pandas package (which we renamed to pd above).

Reading and writing .txt and .csv files

The most common way to read and write dataframes from/to .csv and .txt files is with the Pandas functions “.to_csv()” (for writing) and “.read_csv()” (for reading).

.to_csv()

Here is the abbreviated version of the syntax for .to_csv() excluding parameters that are not used as often:

DataFrame.to_csv(path_or_buf=None, sep=',', columns=None, header=True, index=True, index_label=None)

Note that the default separator is a comma (sep=',') which means we can omit that parameter when we write .csv files. The columns parameter allows you to specify which columns of the dataframe you want to write in the .csv file.

df.to_csv('./mydata.csv', columns = ['B','C'])

To write tab-delimited .txt files, we use sep='\t' and change the file extension to .txt:

df.to_csv('./mydata.txt', sep='\t')

To write space-delimited .txt files, we use sep=' ' (though I always recommend using tabs for .txt files):

df.to_csv('./mydata_space.txt', sep=' ')

.read_csv()

Here is the abbreviated version of the syntax for .read_csv() excluding parameters that are not used as often:

pandas.read_csv(filepath_or_buffer, sep=',',  header='infer', names=None, index_col=None, usecols=None, nrows=None, 
                skiprows = None)

Note that the default separator is a comma (sep=',') which means we can omit that parameter when we read .csv files:

df3 = pd.read_csv('./mydata.csv')
df3

	Unnamed: 0	B	C
0	0	0.380756	0.822901
1	1	0.608980	0.171573
2	2	0.155670	0.083572
3	3	0.193076	0.262668
4	4	0.290597	0.763463

Note however, that we did not specify that the first column is just an index for the table, so that first column was just included as data in the table itself. Note also that .read_csv() guessed that the column names are on the first row, because the default value of the header parameter is infer. To be save, I always recommend being explicit about where the row names and column names are (remember, Python starts counting from 0:

df3 = pd.read_csv('./mydata.csv', header = 0, index_col = 0)
df3

	B	C
0	0.380756	0.822901
1	0.608980	0.171573
2	0.155670	0.083572
3	0.193076	0.262668
4	0.290597	0.763463

To read tab-delimited .txt files, we use sep='\t':

df4 = pd.read_csv('./mydata.txt', sep='\t', header = 0, index_col = 0)
df4

	A	B	C	D
0	0.377738	0.380756	0.822901	0.685419
1	0.134739	0.608980	0.171573	0.064065
2	0.208184	0.155670	0.083572	0.922087
3	0.796854	0.193076	0.262668	0.639699
4	0.927714	0.290597	0.763463	0.390035

To read space-delimited .txt files, we use sep=' ':

df5 = pd.read_csv('./mydata_space.txt', sep=' ', header = 0, index_col = 0)
df5

	A	B	C	D
0	0.377738	0.380756	0.822901	0.685419
1	0.134739	0.608980	0.171573	0.064065
2	0.208184	0.155670	0.083572	0.922087
3	0.796854	0.193076	0.262668	0.639699
4	0.927714	0.290597	0.763463	0.390035

The .to_csv() and .read_csv() functions have a lot more useful parameters. In the practice problems for this lecture, you will be asked to investigate some of them on your own by reading the official documentation:

.to_csv(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
.read_csv(): https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Reading and writing .xlsx files

We’ll use .to_excel() to write Excel files and .read_excel() to read Excel files. The biggest difference from the .csv functions is that, with the excel ones, you can specify a particular sheet in the the Excel file that you want to read/write.

.to_excel()

Here is an abbreviate version of the syntax for .to_excel():

DataFrame.to_excel(excel_writer, sheet_name='Sheet1', columns=None)

df.to_excel('./mydata_excel.xlsx', sheet_name = 's1', columns = ['A', 'C'])

.read_excel()

Here is an abbreviated version of the syntax for .read_excel():

pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, skiprows=None, nrows=None)

df6 = pd.read_excel('./mydata_excel.xlsx', sheet_name = 's1', header = 0, index_col=0, engine = 'openpyxl')
df6

	A	C
0	0.377738	0.822901
1	0.134739	0.171573
2	0.208184	0.083572
3	0.796854	0.262668
4	0.927714	0.763463

The .to_excel() and .read_excel() functions have a lot more useful parameters. In the practice problems for this lecture, you will be asked to investigate some of them on your own by reading the official documentation:

.to_excel(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
.read_excel(): https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

Some important data-acquisition packages

The “requests” package

The requests package allows us to retrieve data from websites. If you want a more detailed discussion of the full functionality of this package, see the documentation at https://docs.python-requests.org/en/latest/.

Here, we’ll just see how we can use the package to download data from files hosted on websites. For this, we need to URL to the file we want to download. In the example below, I use data on economic policy uncertainty in the US from this website:

https://www.policyuncertainty.com/us_monthly.html

If you right-click on the “Download Data” link and select “Copy Link Address”, you should see the link below when you paste it in your code:

url = "https://www.policyuncertainty.com/media/US_Policy_Uncertainty_Data.xlsx"

We use the .get() function to retrieve the (binary) data from the URL above:

r = requests.get(url)

We can check if the request was successful using the status_code attribute. 200 means the request was successful, 404 means there was an error.

r.status_code

To write the data we retrieved into an Excel file on our computer, we use the Python built-in open() function, specifying the path to the file we want to write the data to ('./policy_uncertainty.xlsx below), specifying that we are writing binary data in it (wb below):

outfile = open('./policy_uncertainty.xlsx', 'wb')

type(outfile)

_io.BufferedWriter

Now we can write the data into that file using the Python built-in write() function and then closing that file with the close() function. Note that the actual data from the URL above is found under the content attribute of the request r that we created above.

outfile.write(r.content)
outfile.close()

We can check if this process worked by either manually opening the policy_uncertainty.xlsx in our working directory, or by using pd.read_excel() to just read the data into a dataframe and take a look at it:

df = pd.read_excel('./policy_uncertainty.xlsx').dropna() #.dropna() means drop rows with missing values
df

	Year	Month	Three_Component_Index	News_Based_Policy_Uncert_Index
0	1985	1.0	125.224740	103.748803
1	1985	2.0	99.020809	78.313193
2	1985	3.0	112.190509	100.761475
3	1985	4.0	102.811319	84.778863
4	1985	5.0	120.082716	98.053653
...	...	...	...	...
452	2022	9.0	174.183475	201.738489
453	2022	10.0	177.423127	207.275335
454	2022	11.0	171.737257	210.374894
455	2022	12.0	136.502767	150.156098
456	2023	1.0	143.894044	162.788431

457 rows × 4 columns

The “pandas_datareader” package

The pandas_datareader package allows us to download data from many different sources on the internet. Here is a list of all these sources:

https://pandas-datareader.readthedocs.io/en/latest/readers/index.html

The general syntax to download data from a particular source is as follows:

Syntax:

pandas_datareader.DataReader(name,data_source=None,start=None,end=None)

The two sources I will cover here are the St. Louis Federal Reserve Economic Data (FRED) (data_source = 'fred') which contains a lot of useful macroeconomic data, and the Fama-French Data (Ken French’s Data Library) (data_source = 'famafrench') which contains returns on many portfolios commonly used in asset pricing (e.g. the market portfolio, SMB, HML, etc).

For both of these sources, we use the name parameter to specify what exactly we want to download from these data sources.

For example, to download data on the the CPI from FRED, we need to use name = 'CPIAUCSL' which is the internal name that FRED uses for the CPI data:

https://fred.stlouisfed.org/series/CPIAUCSL

cpi = pdr.DataReader(name = 'CPIAUCSL', data_source = 'fred', start = '2020-09-01', end = '2020-12-31')
cpi

	CPIAUCSL
DATE
2020-09-01	260.190
2020-10-01	260.352
2020-11-01	260.721
2020-12-01	261.564

To download data on the Fama-French three risk factors (market, SMB, and HML) we use name = 'F_Research_Data_Factors'" which is the name of the text file containing these factors on Ken French’s website:

https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

ff3f = pdr.DataReader(name ='F-F_Research_Data_Factors', data_source='famafrench', start='2020-09-01', end='2020-12-31')
ff3f

{0:          Mkt-RF   SMB   HML    RF
 Date                             
 2020-09   -3.63  0.04 -2.68  0.01
 2020-10   -2.10  4.36  4.21  0.01
 2020-11   12.47  5.82  2.14  0.01
 2020-12    4.63  4.89 -1.51  0.01,
 1:       Mkt-RF    SMB    HML    RF
 Date                            
 2020   23.66  13.17 -46.57  0.45,
 'DESCR': 'F-F Research Data Factors\n-------------------------\n\nThis file was created by CMPT_ME_BEME_RETS using the 202212 CRSP database. The 1-month TBill return is from Ibbotson and Associates, Inc. Copyright 2022 Kenneth R. French\n\n  0 : (4 rows x 4 cols)\n  1 : Annual Factors: January-December (1 rows x 4 cols)'}

Note that for the ‘famafrench’ data source, the ‘DataReader’ function return a dictionary of dataframes, not a single pandas dataframe. That’s because the ‘F-F_Research_Data_Factors’ contains multiple tables. The monthly returns on the Fama-French risk factors are in the first entry in that dictionary (the 0 key), so we can retrieve it like this:

ff3f[0]

	Mkt-RF	SMB	HML	RF
Date
2020-09	-3.63	0.04	-2.68	0.01
2020-10	-2.10	4.36	4.21	0.01
2020-11	12.47	5.82	2.14	0.01
2020-12	4.63	4.89	-1.51	0.01

There is no easy way to know under what name you can find the data you need. You have to look at the FRED and Fama-French websites first, to see what names those websites use for the data you need and then type those names into your code, like we did above.

The “yfinance” package

The yfinance package allows us to retrieve stock price data from Yahoo Finance. The full documentation for the package can be found here: https://pypi.org/project/yfinance/ (especially, look under “Fetching data for multiple tickers” on the main page).

Syntax:

yfinance.download(tickers, start = None, end = None, interval = '1d')

For example, to retrieve monthly stock prices for Microsoft and Apple, we need to supply their tickers in a single string (separated by a space) as the first parameter to the download function and change the interval parameter to 1mo (otherwise it will give us daily data):

prc = yf.download(tickers = "MSFT AAPL", start = '2020-09-01', end = '2020-12-31', interval = '1mo')
prc

[*********************100%***********************]  2 of 2 completed

	Adj Close		Close		High		Low		Open		Volume
	AAPL	MSFT	AAPL	MSFT	AAPL	MSFT	AAPL	MSFT	AAPL	MSFT	AAPL	MSFT
Date
2020-09-01	114.239166	206.105713	115.809998	210.330002	137.979996	232.860001	103.099998	196.250000	132.759995	225.509995	3.885245e+09	768176300.0
2020-10-01	107.383453	198.403580	108.860001	202.470001	125.389999	225.210007	107.720001	199.619995	117.639999	213.490005	2.894666e+09	631618000.0
2020-11-01	117.435226	209.770584	119.050003	214.070007	121.989998	228.119995	107.320000	200.119995	109.110001	204.289993	2.123077e+09	573443000.0
2020-11-06	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2020-11-18	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2020-12-01	131.116043	218.523499	132.690002	222.419998	138.789993	227.179993	120.010002	209.110001	121.010002	214.509995	2.322190e+09	594761700.0

We will always drop missing values (with .dropna()) and use Adj Close prices (prices adjusted for dividends and splits):

aprc = prc['Adj Close'].dropna()
aprc

	AAPL	MSFT
Date
2020-09-01	114.239166	206.105713
2020-10-01	107.383453	198.403580
2020-11-01	117.435226	209.770584
2020-12-01	131.116043	218.523499

Note that, if we download data for a single stock, this will return a pandas Series, not a DataFrame:

prc2 = yf.download(tickers = "MSFT", start = '2020-09-01', end = '2020-12-31', interval = '1mo')['Adj Close'].dropna()
prc2

[*********************100%***********************]  1 of 1 completed

Date
2020-09-01    206.105698
2020-10-01    198.403580
2020-11-01    209.770599
2020-12-01    218.523514
Name: Adj Close, dtype: float64

type(prc2)

pandas.core.series.Series

As mentioned before, we will usually turn Series into dataframes before continuing to work with them further:

aprc2 = prc2.to_frame()
aprc2

	Adj Close
Date
2020-09-01	206.105698
2020-10-01	198.403580
2020-11-01	209.770599
2020-12-01	218.523514

The “wrds” package

The wrds package allows us to download data directly from the WRDS database. Unfortunately, this functionality is not available for class accounts like the ones I created for this course. I only mention this package for PhD students, who should be able to use this package with their own individual WRDS credentials. The documentation for this package is found here:

https://pypi.org/project/wrds/